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Bayesian sequential parameter estimation with a Laplace 

type approximation 


Tiep Mai* Simon P. Wilson'l’ 


Abstract 

A method for sequential inference of the fixed parameters of a dynamic latent Gaussian 
models is proposed and evaluated that is based on the iterated Laplace approximation. The 
method provides a useful trade-off between computational performance and the accuracy of 
the approximation to the true posterior distribution. Approximation corrections are shown 
to improve the accuracy of the approximation in simulation studies. A population-based 
approach is also shown to provide a more robust inference method. 


1 Introduction 


This paper explores inference for the parameters of the latent linear Gaussian model: 

yt ^ (1) 

xt = Axt-i+ut, ( 2 ) 

where yt are conditionally independent noisy observations of Xt, Ut iV(0, S„) are independent 
Gaussian innovations and ip is the set of model parameters: A, S„ and any others associated 
with p{yt \xt,ip). They are an important class of dynamic state space models and have seen 
extensive applications in signal processing, epidemiology, environmental statistics, to name but a 
few fields. Statistical inference tasks for these models focus on learning about the latent process, 
and in some cases the model parameters, from observation of the yt- 

Nowadays, many statistical applications concern the analysis of streaming data for which fast 


algorithms are required e.g. dynamic classification (McCormick et al. 2011) or online analysis 


of sensor data (Hill et al. 2009, Yang et al. 2015). A desirable property in these applications is 


that the inference result can be efficiently re-used or modified when another observation arrives. 
As a result, traditional batch processing methods such as MCMC ( Roberts fc Rosenthal||2004 1, 
variational Bayes (Beal fc Ghahramani||2003 1, expectation propagation (MinkapOOl ) and INLA 
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( Rue et al.|2009 1 are not ideal solutions as the approximation must be re-computed ’from scratch’ 
when the next datum is observed. Furthermore, in the field of sequential inference, there have 
been quite many well-established works on the state variable inference such as variants of Kalman 
filter (Julier & Uhlmann 1997 Wan & Merwe 2000) and particle filter ( ]Pitt fc Shephard 1999 


Doucet et al. 2000). However, these algorithms do not completely fit in most practical applications 


which require not only the estimation of the state variable but also the unknown fixed parameters. 

Sequential parameter learning is non-trivial for several reasons: the non-regeneration charac¬ 
teristic of fixed parameters, the assumption of one-datum-at-a-time, the strong coupling between 
state variable and the unknown parameters, and the degeneracy issue of particle-based solutions. 


A comprehensive review of sequential parameter estimation is in Kantas et al. (2014). There 
are broadly two types of approach: maximum-likelihood-based methods try to obtain a point 
estimate of ip, and Bayesian methods that approximate the whole distribution p((p\yi:t); most 
solutions are based on particle filters. 

For Bayesian sequential parameter learning, which is the focus of this paper, |Liu fc West! 
(2001) uses kernel artificial dynamics for unknown parameters and maintains an invariant pa¬ 


rameter variance with a shrinkage factor. Fearnhead (2002), Storvik (2002) propose a method to 
use sufficient statistics to re-sample the unknown parameters, lessening the degeneracy effect of 


particle filters. Carvalho et al. (2010) further extends the idea by combining the sufficient statis¬ 
tics with auxiliary particle filters, with explicit inference for conditional dynamic linear models. 
There are also other Bayesian works such as practical filtering (Poison et al. 2008) and SMC^ 


(Chopin et al. 2013). However, as practical filtering uses lagged observations in fixed lagged 
approximations and SMC^ refreshes particle samples by MCMC with past data upon degeneracy 
encounter, they are not suitable for truly online scenario. 

All above methods are based on particle filters, which are susceptible to outlier and path 
degeneracy ( Andrieu et al.|[^005 Kantas et al.||2014 ). Hence, in this paper, we propose an alter¬ 
native approach for sequential inference, based on a functional approximation, in the expectation 
that a smoothed approximation of both state variable and unknown parameters is more robust 


to outliers than a discrete particle approximation. Like Liu & West (2001), the proposed method 
does not require the existence of sufficient statistics and hence can be applied in various cases 

The paper focuses on approximation otp{xt, p \ yi-t) 3xidp{xt+i, p \ yi-.t), where yi-,t = {?/i, •.., yt}- 
These approximations are derived as a generalisation of iterLap, an iterative Laplace approx¬ 
imation described in Bornkamp (2011), to a sequential setting. As the approximations are 
based on Gaussian mixtures, p is easily marginalised to yield approximations to p{xt \ yi-t) and 
p(xt+i \ yi:t)- Some approximation corrections by importance sampling and a population-based 


^The proposed method is less generalised than llJu &: West] | |200l] l due to the assumption of Gaussian linear 
state equation. Logically, with an extra computational cost, this constraint can be overcome either by a joint 
functional approximation p^xt+i, xt, \ yi-.t), or double functional approximations P(^xt, \ yi-.t), P(Xt+l, P \ yi:t) 
at each time point. However, this extension is not covered in this paper. 
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strategy of selective exploration are also developed, and shown to produce a more robust approx¬ 
imation. 

The paper is organised as follows. A brief description of the iterLap approximation is given in 
Section In Section the method is adapted for use in a sequential setting. The performance 
of the algorithm and the approximation corrections is evaluated in Section In Section a 
population-based approach is developed that adds further robustness to the method. There are 
some concluding remarks in Section]^ 


2 The iterLap approximation 


The Laplace approximation ( Tierney &: Kadane||1986 ) is one of the most commonly used func¬ 
tional approximations in statistical methods. For an non-normalised density q{x), it uses a second 
order Taylor expansion of log(( 7 (a:)) about a local maximum to obtain a Gaussian approximation 
q{x) = N(x I /r = ai, E = Q~^), where the mean a; is a local maximum of q and the precision 
matrix Q is the Hessian of log(g(a:)) evaluated at a; = a;. When a posterior distribution converges 
to the Gaussian asymptotically, the Laplace approximation can be very efficient. However, for 
non-Gaussian distributions, this approximation suffers from several shortcomings, of which we 
mention: it is a uni-modal approximation and so ignores other modes if the target density is 
multi-modal, it does not approximate highly skewed or heavy-tailed distributions well, and the 
Gaussian distribution implies a linear correlation between components of x and so cannot account 
for non-linear dependencies. 

The Laplace approximation can be generalised to a mixture of Gaussians approximation 

m 

Pm{x) = '^W^N{x\^li,Q~'^), 


with component weights Wi. Gelman et al. (2003, chap. 12) obtain the fii and Qi by simultaneous 
optimisations with different starting points, from which the Wi are evaluated by equating the 
target density q{x) to qm{x). The difficulties with this approach are that it is often inefficient 
when q{x) is a skewed uni-modal target density, as the optimisation solutions cluster around 
the unique global maximum of q{x) and produce a mixture density that is unimodal and almost 
symmetric. Another difficulty is the inefficiency in identifying local modes through random 
starting points. 


The iterated Laplace (iterLap) approximation, introduced in Bornkamp (20111, tries to over¬ 
come these difficulties by constructing the Gaussian mixture iteratively. Given an approximation 
with m components, another component is added to the mixture by constructing a Laplace ap¬ 
proximation for the difference between the target density and the approximation so far. This 
yields a new component mean fj-m+i and variance Sm+i ■ Weights for the new and already- 
existing components are then re-evaluated by quadratic programming. Further components are 
added until one of the stopping criteria for the algorithm is met. 
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2.1 Modifying the iterLap approximation for sequential inference 


Many modifications to the original iterLap approximation are possible. |Mai fc Wils"^ ( [Mst 
describes several that are implemented in this work for different parts of the algorithm, and 
have been shown empirically to give better performance when applied to latent Gaussian models. 
These include: modifying the stopping criteria, selecting initial optimisation points, scaling the 
Hessian matrix, and proposing alternative methods for re-assigning component weights. 

As an example. Figure shows the iterLap approximation of the posterior density of the 
precision parameters (A^, A^) in the Gaussian dynamic linear model : 


yt = xt +vt, 

Xt = Xt-l + Ut, 

where Ut ^ N{0,a'^ = A“^), Vt ^ N{0,a‘^ = A“^). The priors for precision parameters A„ 
and Xy are exponential with mean 2. One hundred observations are generated from the model 
with A* = 0.25, A* = 1. The posterior p{Xy,Xu | ymoo) can be evaluated up to a constant in 
closed form by marginalising out xi^oo, allowing comparison between the approximations and 
the target. The original iterLap approximation of [Bornkamp (2011|) and that of Mai & Wilson 


(2015) are applied to p{Tu,Ty \ y) where Tu = log(A„) and Ty = log(A„). The maximum number 

max is set to 30. 


of mixture components m„ 

The optimal iterLap approximation stops with 6 components and the modified iterLap with 


19 components. More comparison examples of two R implementations are given in (Mai & 


Wilson 2015), illustrating the trade-off between approximation accuracy and running time; in 


general the modified algorithm offers improved accuracy at a modest increase in computation 
time. The modified iterLap performs quite well in low-dimensional space but suffers from the 
curse of dimensionality and so performance deteriorates with the dimension of the target density. 
Still, the iterative and non-sampling nature of the algorithm means that it is faster than Monte 
Carlo based solutions to density approximation. Furthermore, these functional approximations 
can complement Monte Carlo methods by serving as non-linear multi-modal sampling proposals, 
providing an efficient way to explore complex parameter spaces. 


3 Extending iterLap for sequential learning 

3.1 Base algorithm 

Given a Gaussian mixture approximation to p{xt-i, p \ yi-.(t-i))i the properties of the Gaussian 
and the linearity of the state equation permit a closed-form expression for the prediction distri¬ 
bution p(xt,p\yi-.(t-i))- Using this, iterLap can be applied to obtain an approximation to the 
next filtering density p{xt, p \ yi-.t)- 
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(c) modified iterLap: (tu,Tv) (5.024 seconds) 


(d) modified iterLap: (Au, A«) (5.024 seconds) 


Figure 1: Left: the target posterior density contours (blue) and iterLap approximation (black) 
contours for p ( t „, Ty\y). Right: the approximation transformed back to (A„, A„). The red crosses 
are the iterLap component means. 


Assume that iterLap has provided a Gaussian mixture approximation to ip \ 


rrit-i 

i=l 

Each Gaussian component of p{xt-i,p \ yi-,(t-i)) is decomposed into a Gaussian marginal for 
If and a conditional Gaussian for Xt-i given p: 


Nix,.„p I = Nip I iQfYY^)Nix,., \ 




where: 




— 1 . 

\^i,(p(p > 1 
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where Qi*xt^_\xt-i’ Qi^xt-iv” relevant sub-matrices of Q-* and is a sub-matrix 

of 

Since the state equation is linear and Vt is Gaussian, so an approximation to the joint density 
of Xt, Xt-i and (p can be obtained: 


p{xt-i,xt,p\yi,(t-i)) = p[xt-i,p\yi:(t-i))p{xt\xt-i,(p) 


i^l 

( 3 ) 

Then Xt-i can be marginalised out to get the approximation to p{xt, p \ j/i:(t_i)) of Equation 


n 


p{xt,p\yi,{t-i)) = JPixt-i,xt,p\yi,(t-i)) dxt-i 

rrit-i 

= I / N{xt-i I I dx 

2 = 1 

/ 1 \ r 

= (27r)-‘^“=/2|Q„|i/2exp i^-x^Q^xtj ^ 


X \Qtx!l,w\^'^\QrJ exp 


-77^ O. 77. 

2 _i'^2,a:t_ 11(/?' 2,a;t —1 


( 4 ) 


with: 


Qi,ip 

Pi,Lp 


+ A QuXt)- 


Notice that even though it is more compact to write Equation [fusing the variance representation, 
evaluating p{xt, p \ yi-(t-i)) with the precision matrix is more efhcient computationally. 

An approximation (up to a constant) for p{xt, p \ yi-.t) is p{xt, p \ yi-.{t-i))pijjt | a:*, p)- IterLap 
is applied to this approximation to derive the Gaussian mixture approximation p(xt,p\yi:t)- 
This completes one cycle of the algorithm. A summary is given in Algorithm [l] 

Sincep(xi, p \ yi^t) is a Gaussian mixture, so are the marginal densitiesp(vj | yi-t) andp(a;t | yi^t) 
and easily derived. Also, log{p{xt,p\yi-,(^t-i))) is quadratic in xt, hence the conditional density 
p{xt \ yi:{t-i),p) is also a Gaussian mixture. However, log(p(x(, (^ | is not quadratic in 

p due to the presence of the determinant term and so p{p \ yi-^t-i)) is of more complicated form. 
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Algorithm 1 SIBS: Base algorithm of sequential inference with iterLap 

1. For < = 1, derive a Gaussian mixture approximation p{xi,ip | ?/i) to p{xi, ip \ yi) by modified 
iterLap: 


p{xi,p) « p{yi \xi,<p) p{xi,ip) oc p{xi,ip I yi). (5) 

2. For t = 2 : n, assuming a Gaussian mixture approximation p{xt-i, p \ yi:t-i) exists, then 
derive 

^xt, p I yi:(t-i)) = jp{xt I xt-i,p) p{xt-i,p I yt-i) dxt-i, (6) 

which is a non-Gaussian mixture and an approximation to p{xt^ p \ yi.,(t-i)). Then derive a 
Gaussian mixture approximation p{xt, p \ yi-t) to p{xt, p \ yi-t) by modified iterLap: 

p{xt,p I yi:t) « p{yt I Xt, p) p{xt, P I ?/l:(t-l)) « p{yt I Xt, p) p{xt, P I yi:(t-i)) fx p{xt, P I yi-.t)- 

(7) 


3.2 Error correction using EM 

To improve the sequential approximation, density correction is done by combining importance 
sampling with expectation maximization to obtain Algorithm Importance sampling is applied 
to the target density p{xt, p \ yi.t) with p{xt, p \ yi-t) as the proposal function. Then, expectation 


maximization is used to fit a Gaussian mixture to the re-sampled points, following Bishop (2006 


chap. 9), using p{xt, p \ yi-.t) as the initial solution. This algorithm is denoted SIEM. 

A simpler variant of Algorithm is just to fit a single Gaussian to replace Steps 3 to 5, 
with weighted mean and variance calculated from sampled values ■ ■ ■; ix[^\ p^^'>). 

We denote this approach as SIG (Algorithm 3). It is noted that SIG is not the simple Laplace 
approximation as it produces a single Gaussian that cover multiple components of iterLap ap¬ 
proximation, creating a ’’smoothed” version of the iterLap density. 


3.3 Example model and non-identifiability 

An example of the model defined in Equations [l] and which will be used in this paper, is: 

yt = a^+vt, (8) 

at = l{cixt -I- C2Zt -I- 5 > 0) (cixt -I- C2Zt + 5), (9) 

Xt = axt-i+ut, ( 10 ) 

where l(-) is the indicator function; Zt is a known exogenous time series generated by an iid 
Gaussian distribution zt A(0, cr^ = 0.5^) ; Ut, Vt are univariate iid Gaussian variables with 
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Algorithm 2 SIEM: Bias correction with expectation maximization 
For each t: 

1. Derive a Gaussian mixture approximation pjL{xt,ip \ yi-t) by modified iterLap as in Algo¬ 
rithm [2 

2. Sample from 

3. Calculate weights oc p{x["\ip^^'> \yi,t)/piLix[’‘\tp^'^^ \ yi-.t)', 

4. Re-sample M* values from ..., with weights to get re-weighted 

samples \ )); 

5. Apply the EM algorithm to find a Gaussian mixture fit pEAiixt, p \ yi-.t) to the 
with the same number of mixture components as p{xt, p \ yi-t)- 

6. Use the EM approximation pEmixt, p \ yi-t) as p(xt, p \ yi:t) in Algorithmand derive the 
next iterLap approximation (back to step 1). 


Ut ^ N{0,a1 = A“^), Vt ~ N{0,a^ = A“^); = log(A„) and r„ = log(A„). The full parameter 

vector is (a, Ci: 2 , t„, t„) or (a, Ci: 2 , <7^, (j„). Like many latent models, inference for the parameters 
and latent process in this case can suffer from non-identifiability in the parameters and latent 
process, and this has an impact on inference performance and how one assesses the performance 
of the approximation. In Bayesian methods, this is often seen in a posterior distribution as a 
posterior with multiple modes of the same weight, or a path of values whose posterior probabilities 
are very near the modal value. Formally, the type of non-identifiability under consideration is 
defined as follows. 

Definition 1 (Non-identifiability). A state space model has a non-identifiability J\fIT>{g a) when 
there is a transformation {q'atx'i) = f{gA,xi) such that p{yi-,n \ x'i, g'A, Qb) = p{yi-.n \ xi, gA, Qb) 
where gA C p and gs = p \ gA- 

Under this definition, the model of Equations[^to[l0|has 2 non-identifiability issues: Ml'D{ci,au) 
and NT'D{ci) (see Appendixfor details). We do not address this issue here, but reduce its 
impact for this model in two ways: place a prior on ci that favours one of the identifiable modes, 
or simply assume that one of parameters in the non-identifiable set is known. 

4 Examples 

All the examples in this section are based on the model of Equations i to with n = 5000 
data points and parameters: a* = 0.8, c* = 1.5, C 2 = —1, cr* = 0.3 (r* « 2.407) and a* = 10 
(t* « -4.605). 





4.1 Example 1 

It is assumed that a* and cr* are known or can be estimated offline; and the prior for the other 
unknown parameters <p = (o,ci,C 2 ) and the initial state xi is: 

p(xi,^,) = N(xi,ip I M = (0,0.5,1, -3), S = diag(22,0.O^, 1,1)). (11) 

The SIEM and SIG algorithms are run ten times on the same data so that differences between 
the runs are due to the importance sampling step of the algorithm. The iterLap approximation is 
used with 5 mixture components. The marginal filtering distributions of unknown parameters and 
state variable are plotted in Figuresandas a function of t for the SIEM and SIG approaches. 
The means with 2 standard deviation limits are shown. For clarity, time is plotted on a square 
root scale and truncated to the last 200 observations for xt- These plots show that Monte Carlo 
variation between runs is considerably larger for the Gaussian mixture approximation (SIEM) 
than for the single component approximation (SIG). Both appear to produce approximations 
with good marginal fit to the data, although with some possible bias in parameter estimation. 

Figure [^explores the quality of the approximation in more details. It shows the scaled joint 
model log probability log(p(j/i:t, xi-t, p))/t as a function of t, where Xi is the marginal mean with 
respect to PEAiixi, p \yi:i) and ip is the marginal mean with respect to PEMix 5 ooo,^\yi-. 500 o)- 
The figure shows that in many cases the SIEM and SIG algorithms infer a solution that is a better 
fit, in terms of model probability, than the true parameter values but that the algorithm tends 
not to move between possible solutions. This is a numerical counterpart to the non-identifiability 
issue mentioned in the previous section. 

To understand the difference between SIEM and SIG, and the ’’smoothing” effect of SIG, 
Figure plots the mixture components of the marginal density of Ci from PEMi^t, P \ Vi-.t) of a 
single SIEM run. It can be seen that the approximation SIEM eliminates a component that is 
very close to the true value of ci around t = 260. This happens because the approximation will 
try to best accommodate the latest observation, and so is vulnerable to outlier observations that 
leave large weight on components that do not fit past or future observations very well. On the 
other hand, if a single Gaussian component is used to cover multiple components in Figure 
around t = 200, the approximation will be more conservative and more robust to outliers. This 
explains the more consistent performance of the single component approximation SIG; this single 
component smooths out any local large change in the filtering distributions and so is more robust 
to outlier observations. 

Our experience is that the SIG approach produces more consistent and robust results across 
a whole set of examples that we have tried. Hence, the remainder of the paper will focus on SIG 
and its extensions. 
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Figure 2: SIEM approach: the marginal filtering distributions of unknown parameters and state 
variable for the example in Section |4.1| over 10 runs. The bright red curves represent the true 
values of unknown parameters and state variable. 

4.2 Example 2 

The previous example showed differences in the approximation across separate runs due to Monte 
Carlo error in the importance sampling. In this example, the sensitivity of the inference to the 
prior is explored. The state and observation precisions are also assumed unknown; the non- 
identifiability issue between them makes the inference very sensitive to the choice of prior, and 
this can increase the differences between runs. As mentioned earlier, the SIG approach is used 
as it has smaller Monte Carlo variation and more robust results. 

The unknown parameter vector is = (a, C2, r^, t„), with ci assumed known in order to avoid 
the non-identifiability mentioned in Section |3.3| Data are simulated with the same parameter 
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Figure 3: SIG approach: the marginal filtering distributions of unknown parameters and state 


variable for the example in Section 4.1 


values as Example 1 (a* = 0.8, = 1.5, cj = —1, r* = 2.407 and t* = —4.605). Two priors are 

used. The first is quite informative and is close to the true parameter values: 

= N{xi,ip I = (0,0.5,0,3, -4), = diag(22,0.9^, 1,0.5", 0.5")). (12) 

The second is less informative with a higher variance and a support further away from the true 
values (fi*: 

pixuip) = Nixi,ip I = (0,0.2, 0,5, -2), = diag(2", 0.9", 1,2", 2")). (13) 

The marginal filtering densities from the SIG approach with priors by Equations and are 
shown in Figures]^ and [^respectively. 
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Figure 4: The scaled joint log density log{p{yi-t,xi:t, p))/t versus time stamp in square root scale 
for the example in Section [4.1| The bright red curve corresponds with the evaluation at the true 
values of tp and xi. 



Figure 5: Mixture components of P£:m(ci | yi-.t) by the SIEM approach for the example in Section 


4.1 Each shaded colour corresponds to the two standard deviations around the mean of the 


different mixture components. 


Figurej^shows that SIG runs do converge to the true parameter values with the first informa¬ 
tive prior. However, inference results with the prior by Equation show significant differences, 
especially in Figures [7c| to [Tej The runs can be divided into two groups. In one group, the state 
equation precision is inferred to be smaller and the observation precision to be higher than 
the true values, which induces the inferred value of Xt to strongly track the observations. In the 
other group, is inferred to be larger and smaller, which leaves the inferred value Xt to stay 
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(f) \og{p{yi:t,Xl:t,<fi))/t 


Figure 6: SIG approach. Figures 6a to 6e show the marginal filtering distributions of unknown 
parameters and state variable for the example in Section [4.2| with prior in Equation |12[ Figure 
plots the scaled joint log density with the bright red curve corresponds with the evaluation at 
the true values. 
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Figure 7: SIG approach. Figures 7a to 7e show the marginal filtering distributions of unknown 
parameters and state variable for the example in Section [4.2| with prior in Equation |13[ Figure 
plots the scaled joint log density with the bright red curve corresponds with the evaluation at 
the true parameter values. 
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constant. Although both are reasonable local approximations, the first group is more interesting 
than the second in terms of interpretability. 

The scaled joint log densities for results of two priors are plotted in Figures [6^ and For 
the case of the second prior, Figure shows that the densities values of two discussed groups 
above are still higher than the one of true parameters values, hinting again at the problem of 
numerical non-identifiability for these local approximations. 


4.3 Comparison with pomp 


In this section, the SIG approach is compared with the bsmc function of R package pomp (King 


et al.|2015 ), which is an implementation of Liu and West’s method ( Liu &: West|2001 ). We choose 
this method as it has similar settings with our proposed method, a Bayesian solution with no 
assumption of sufficient statistics. 

We first test pomp-bsmc directly on the same data and prior of Section [4T] with 10 parallel 
runs each of which has 10000 particles. In this case, pomp-bsmc provides similar results to SIG’s 
of Figure illustrative sequential traces of parameter a by pomp-bsmc are given in Figure |8a] 

Then, both methods are tested for the robustness with respect to outliers. Three consecutive 
and independent outliers Ot= 2 i :23 N{0, a = 30) are added to the data y 2 i: 23 - Estimation results 

of parameter a by SIG and pomp-bsmc are shown in Figures [8b| and [Sc} As expected, particle- 
based pomp-bsmc decays quickly to few particles while SIG estimation, even also affected by 
outliers, is quite robust and maintains a wide standard deviation. 

The final test is with a different observation equation, yt = exp(at) -I- Vt- Again, Figures 8d 
and ^ show that SIG provides more stable results than pomp-bsmc 


5 Population-based strategies 

It is seen from the previous section that the SIG approach is capable of locally approximating 
the target filtering distributions but may be sensitive to the prior; this is seen especially in the 
example where both variances are unknown. In this section a method for implementing parallel 
runs of SIG is described that improves the identification of the principal mode of the target 
distribution and offers a pragmatic solution to the issue of specifying a prior that permits the 
algorithm to produce a good approximation. 

It is noted that any differences between independent runs of the SIG approximation of Section 
l^is due to the importance sampling, making it reasonable to select a run that has good perfor¬ 
mance and propagate that run forward, without weighting across all runs. This is analogous to 
optimisation with several starting values. 

^It is noted that pomp-bsmc has run-time error and stops halfway during the experiments. It is reported that 
this error is due to the degeneracy into few particles of pomp-bsmc. 


15 








CO 0- 


- 1 - 


- 2 - 



1000 


2000 3000 4000 5000 
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(d) pomp-bsmc: the case with yt = exp(at) -|- vt (e) SIG: the case with yt = exp(at) -|- vt 

Figure 8: Sequential estimation of a by SIG and pomp-bsmc. 


In this population-based strategy, N parallel runs of the SIG method are made and are re¬ 
sampled from time to time, according to their predictive performance. The objective is to identify 
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runs that are performing well and favour them in the re-sampling. “Bad” runs, that are stuck 
at a local mode that has poor consistency with the data, end up being rejected. The method 
will focus on exploration of the space of both state and unknown parameters that give small 
prediction error. At any time, each single run can be used as a proper best-effort approximation 
of the filtering distribution p{xt, p \ Xi-t). These runs can also be combined in a weighted average 
for a more robust sequential approximation. 

We define a general predictive performance function based on square error. Define a set of 
re-sampling times ri < r 2 < ■ ■ ■ rs and a maximum lag L at which to compute predictions. For 
any run h, at one of the re-sampling times t = r^, the cumulative square prediction error since 
the last re-sampling is examined for each lag I and component i oi yt- 

Vs 

SSEn-i,i= ^ Z = 0, ...,L; i= l,...,dim(yt), 

where ^ is the Fstep ahead prediction of yt^i of run n i.e. the prediction of yt^i made at 
time t — 1. The prediction y^.^ ^ can be obtained either by using Monte Carlo approximation 
of E,n{yt^i 1 2 /i:((_q) where the expectation is with respect to approximation p{xt-i, ^ \ yi-,(t-i)) of 
run h or a point prediction of yt^i at the approximated mean (xt-i^ip) of p(xt-i, ip \ J/i:(t-i))- To 
save the computation cost, point prediction is used in this paper. Note that I starts from zero as 
we also want to take into account the filtering as well as prediction error. 

The predictive performance measure is a weighted sum of these squared errors over lags 
0,1,..., L and the components of yp 

L dim(yt) 

Sn = Y. Y. (14) 

l—O 2—0 

Notice that both the cumulative square prediction error and predictive performance measure can 
be updated sequentially at each time step t by keeping previous predictions y[\ made at time 
it-l). 

At a re-sampling time, N runs are re-sampled with replacement from the existing N runs 
with probabilities: 

P(sample run n) oc exp(—iSfi), 

where Sn is the measure of Equation derived from the approximation of the nth run. 

The user-defined weights uii^i should both standardise the errors across components of yt and 
can reflect differing importances of predictions at different lags. In this paper, we use: 

wz 

9 5 

where ct is a rough estimation of sample standard deviation of yt^i of the first 50 observations 
and uji are the importances of each prediction lag. For all the subsequent experiments, we use 
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Algorithm 4 SIG-RS: Resampling parallel SIG runs 


1. Specify a number of parallel runs N, a maximum prediction lag L, a set of re-sampling 
times ri < r 2 < • • ■ < rs and a set of weights / = 0,..., L; i = 1,..., dim(j/t). 

2. For each t: 


(a) Update each run’s approximation to independently following the SIG 


algorithm of Section 3.2 


(b) Compute and store the Z-step ahead predictions from time t, for Z = 0,..., L; 

(c) Sequentially update Sn for each run n = 1, ..., TV, following Equation 

(d) If t = Ts from some s, re-sample N runs with replacement using probabilities propor¬ 
tional to exp(— 


14 


L = 1 and set (wq = 0.2, = 0.8) to put more importance weight to the one-step prediction 

error. Also, 

An extension of this algorithm also permits a pragmatic solution to the problem of vague or 
mis-specified priors in sequential analysis. In both cases, the algorithms of Sectioncan perform 
badly because the inference lacks good regularization from the prior and so, depending on what 
is observed, p{xt,p\yi-.t) may become degenerate or be caught at a bad local approximation 
from which it is very unlikely to escape even after a large sequence of observations. This is 
related to problems in other methods such as convergence of an MCMC algorithm or to the 
global solution in an optimisation problem. Since the SIG approach suffers from being trapped 
at a local approximation, in addition to the resampling step above, we propose a simple strategy 
called SIG-RSRP (Algorithm 5). 

In the SIG-RSRP algorithm, the SIG-RS algorithm is run but in addition, at each re-sampling 
time Ts, if the predictive error of the parallel runs is too small, then the variance of p(xt, tp \ yi-.t) 


is tempered to a larger, pre-set, value. This is analogous to simulated tempering in MGMG (Neal 


20011. Specifically, a set of threshold marginal variances for (xt, p) are pre-defined. At t = for 


some s, if all of the marginal posterior variances of p{xt, tp \ yi-.t) £^re smaller than their respective 
threshold then the variances in p{xt,<p\yi-.t) are re-set to these thresholds and the algorithm 
proceeds. 

This is equivalent to having run the algorithm with a different prior with larger variance. 
In this sense it is a pragmatic solution as it violates the a priori specihcation of the prior. It 
is noted that that, unlike SIG-RS, the SIG-RSRP method does alter the target distribution by 
tempering the prior. However, like SIG-RS, SIG-RSRP stops the switching and prior tempering 
when t > rs', hence, one can interpret SIG-RSRP as a sequential procedure to select a prior that 
was able to regularize the inference up to Z = rs before continuing with the usual SIG algorithm. 
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Both SIG-RS and SIG-RSRP are applied on the example of Sect ion [4?2| For SIG-RSRP, the re¬ 
sampling times are set to ri = 10 < r 2 = 25 < ra = 50 < r 4 = 75 < rs = 100 < ... < rg = 2000; 
and ^xt,(p is set to diag(0.3^, 0.3^, 0.25^, 0.25^, 0.5^), which is approximately a fraction of prior 

(p) 

variance T,x[,ip. 

Figures and [T0| show the filtering results of SlG-RS and SIG-RSRP respectively. Gompared 
to the SIG approach, the traces of SIG-RS and SIG-RSRP favour the group which has higher 
state variance, making the state variable more adapted to the observation due to the prediction- 
error criterion. Between these two, SIG-RSRP has more room to change due to the relaxation 
and there are definitely trends of SIG-RSRP traces moving closer to the true values over time. 

6 Conclusion 

This paper’s principal contributions are the following: 

• It has extended the iterated Laplace approximation to a sequential setting for joint param¬ 
eter and latent process inference in latent Gaussian models. This permits approximations 
with skewness, multi-modality and non-linear correlations. 

• It has developed two single Gaussian component approximations, one based on an impor¬ 
tance sampling correction and the other based on a population idea, to this joint posterior 
that are robust to outlier observations and a provide good performance in terms of posterior 
predictions. 

• The methods also demonstrate good computational performance as they are based on 
Gaussian approximations. 

Figure [TT] shows the chain of development of the 5 algorithms that are explored in this paper. 
Starting from Algorithm I, which can be viewed as the natural extension of iterLap to the 
sequential case, subsequent algorithms are derived to address some issue that arises with the 
previous algorithm in the chain, following evaluation on test data. 

The most interesting conclusion is that a single Gaussian approximation, obtained as a 
smoothed version of a Gaussian mixture, has been shown to have attractive properties in se¬ 
quential analysis. It must be emphasised that this is not the Laplace approximation but rather is 
derived as a smoothed version of a Gaussian mixture. The iterLap approximation is key to pro¬ 
viding a good Gaussian mixture approximation around which the single Gaussian can be built. 
An important difficulty that these approximations encounter is how to react to outlier observa¬ 
tions, also a cause of degeneracy in particle filter methods. It is here that the single Gaussian is 
shown to be robust. 

Non-identifiability of parameters is also an issue that causes difficulties for parameter estima¬ 
tion, and the population-based approach has been shown to provide a solution to this through 
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Figure 9: SIG-RS approach: the marginal filtering distributions of unknown parameters and 


state variable for the example in Section 4.2 with prior in Equation 13 
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Figure 10: SIG-RSRP approach: the marginal filtering distributions of unknown parameters and 


state variable for the example in Section 4.2 with prior in Equation 13 
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Figure 11: Summary of explored algorithms. 


allowing several different approximation runs to work in parallel, and then be combined in a 
sensible manner. 

Which of the 5 algorithms should one use? The final algorithm SIG-RSRP addresses the most 
of the issues that have arisen when implementing these approximations, and produces the most 
consistent and accurate approximation, but a subjective Bayesian may have reservations about 
using it as it is implicitly altering the prior following observation of data. Similarly, there may be 
some objections to the population idea that is also used in algorithm 4 (SIG-RS), as resampling 
the best approximations from an set of them could be seen as biasing the posterior uncertainty. 
On this basis, the recommendation is that algorithm 3 (SIG) is best for fully (subjective) Bayesian 
analysis, while SIG-RS and SIG-RSRP provide a pragmatic solution that performs well. 
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A Non-identifiability with the model of Equations to p!0 


Assuming that all parameters are unknown, the example model suffers from some cases of non- 
identihability that we know, as follows. 


Case 1. AfI'D{ci,au) 


Define c'l = cijj5] a'^ = x[ = j3xt and uj = Put for t = 1 : n with any /3 > 0. It can be 
easily seen that: 


at = '^{dix’t + C2Zt + 5 > 0) {c'lx't + C2Zt + 5), 

x't = ax^_^ +«;, 

and hence: 

P{yi-.n,X2,n I x[,a, c'l, C2, Cr(,, = piVl-.n, X2:n | 2^1, O, Cl, C2, Cr„, Cr^) 

P{yi:n^ X2:n I Xl^ U, Ci, C2, fT^,) (iX2-^ — Piyi'.Tn X2:n \ Xl ^ U, Cl, C2, Cy) dX2:n 

^ p{yi-n\x'i,a,c[,C 2 ,cr'^,cry) = p(yi:„ | Xi, a, Ci, C 2 , o-„, cr„) 

Case 2. The non-identifiability AfI'D{ci) can be proved easily with c'l = —ci; x't = —Xt for 
t = 1 : n. 

Depending on the model, the non-identifiability problem can become more severe. For ex¬ 
ample, if Zt is a series with identical and independent increment {zt — Zt-i ~ fV(0, •)), then 
there exists a new non-identifiability case MlT>{c 2 ,cru). Or if the model is extended with 
at = l{ciXt + C2Zt + C3 > 0) {ciXt C2Zt -I- C3) and {xt - Px) = a{xt-i - Px) + Ut, then there 
is JVT'D{c 3 , Px)- It is noted that all these cases can be combined, resulting in a very irregular 
likelihood p(yi-.n \xi,^) with multi-modality and narrow density ridges. 
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